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ABSTRACT 

Information  drives  today's  businesses  and  the  Internet  is  a  powerhouse  of  information.  Most  businesses  rely  on  the 
web  to  gather  data  that  is  crucial  to  their  decision  making  processes.  Companies  regularly  assimilate  and  analyze  product 
specifications,  pricing  information,  market  trends  and  regulatory  information  from  various  websites  and  when  performed 
manually,  this  is  often  a  time  consuming,  error-prone  process.  So  it  is  very  important  to  create  a  simplified  algorithm/tool 
that  can  easily  extract  data  from  web  page  and  publish  the  extracted  data  in  desired  manner. 
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INTRODUCTION 

•  Web  data  mining  is  a  kind  of  data  mining.  Basically  it's  a  technique  used  to  crawl  through  various  web  resources 
to  collect  required  information,  which  enables  an  individual  or  a  company  to  promote  business,  understanding 
marketing  dynamics,  new  promotions  floating  on  the  Internet,  etc.  There  is  a  growing  trend  among  companies, 
organizations  and  individuals  alike  to  gather  information  through  web  data  mining  to  utilize  that  information  in 
their  best  interest. 

•  Before  this  data  mining  software  came  into  being,  different  businesses  used  to  collect  information  from  recorded 
data  sources.  But  the  bulk  of  this  information  is  too  much  too  daunting  and  time  consuming  to  gather  by  going 
through  all  the  records,  therefore  the  approach  of  computer  based  data  mining  came  into  being  and  has  gained 
huge  popularity  and  has  become  a  necessity  for  the  survival  of  most  businesses 
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Figure  1 

TYPES  OF  WEB  DATA  MINING 
Web  Data  is  Classified  as 

•  Web  Content  -  text,  image,  records,  etc.  e.g.  extracting  business  information  from  web  site. 

•  Web  Structure  -  hyperlinks,  tags,  etc.  for  e.g.  finding  out  number  of  links  of  particular  website. 

•  Web  Usage  -  http  logs,  app  server  logs,  etc.  for  e.g.  finding  out  number  of  request  per  day. 


Impact  Factor(JCC): 


is  article  can  be  downloaded  from  www.impactjoiirnals.us 


10 


Swati  Sandip  Patel 


•  Web_Content 

Web  Content  Mining  is  the  process  of  extracting  useful  information  from  the  contents  of  Web.  It  may  consist  of 
text,  images,  audio,  video,  or  structured  records  such  as  lists  and  tables.  Research  activities  in  this  field  also  involve  using 
techniques  from  other  disciplines  such  as  Information  Retrieval  (IR)  and  natural  language  processing  (NLP). 

•  Web  Structure 

Identifying  interesting  graph  patterns  or  preprocessing  the  whole  web  graph  to  come  up  with  metrics  such  as  Page 
Rank.  Web  Structure  Mining  can  be  is  the  process  of  discovering  structure  information  from  the  Web.  This  type  of  mining 
can  be  performed  either  at  the  (intra-page)  document  level  or  at  the  (inter-page)  hyperlink  level.  It's  a  useful  source  for 
extracting  information  such  as  quality  of  web  page,  interesting  web  structure  etc. 

•  Web  Usage 

User  identification,  session  creation,  robot  detection  and  filtering,  and  extracting  usage  path  patterns 
CURRENT  TRENDS  AND  LIMITATIONS 

•  Currently  the  mining  tools  available  in  the  market  include  Kapow,  Automation  Anywhere  etc.,  which  facilitate 
the  web  mining  by  creating  crawler/robot/spider/bots  etc.  for  different  websites 

•  But  all  tools  are  quite  expensive  and  have  limited  support  for  complex  html  structure. 

•  The  tools  available  these  days  are  not  so  user  friendly  so  it  is  very  difficult  for  business/novice  user  to  create 
bots/spider  on  fly. 

•  The  tools  available  this  days  requires  programming  knowledge  which  will  not  be  meaningful  for  the  general  user 

•  The  tools  are  lacking  IP  rotation  facility  that  is  very  important  especially  when  you  are  dealing  with  famous  web 
sites  like  Google,  Facebook  etc. 

•  The  tools  are  not  designed  to  support  large  data  scrapping  so  scalability  is  also  one  of  the  concerns  when 
scrapping  huge  data  over  the  web. 

•  The  tools  are  limited  to  avoid  duplicate  scrapping  and  usually  get  stuck  up  the  iterative  job 
PROBLEM  FORMULATION  AND  CHALLENGES 

•  Frequent  Changes  in  Web  site  structure. 

•  Java  script  based  loading  of  page  content,  (e.g.  Ajax  requests) 

•  Collaborative  Crawling 

•  Avoid  lots  of  redundant  crawling,  (e.  g  recursive  references) 

•  Avoid  IP  blockage  from  third  party  site  (possible  solution,  need  IP  rotation  program). 

•  Efficient  fetching  (hundreds  of  pages  per  second.  Possible  solution:  implementing  tread  or  task  based 
architecture) 
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•  Extract  Mutual  Funds  information  from  a  website  daily. 

•  Extract  data  from  one  online  system  and  transfer  it  to  another  online  system. 

•  Scrape  unstructured  data  from  the  web  and  transfer  it  to  Excel 

•  Regularly  download  updated  web  images  of  weather  maps 
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